Human Trafficking has always been a major problem in the world and it has devastating effects on its victims. With new databases we can examine risk factors in order to help mitigate this problem and counter the abusers. In this project I will be examing the age of the victims, knowledge of the abuser, type of abuse used for trafficking, and whether the victim was abducted or not. The data that I will use comes from the CTDC and contains information from all over the world and dates back to 2002. In order to keep a reasonable scope for the predictions the predictions will only be done on United States cases. These cases range from years 2015 to 2018. Using machine learning we can predict whether abusers are more violent or have a closer relation to the victim based on factors like age.
This project will be completed in Python using the pandas, numpy, scikit-learn, seaborn, matplotlib, and folium.
import pandas as pd
import seaborn as sea
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import numpy as np
import folium
from folium.plugins import MarkerCluster
import warnings
warnings.filterwarnings('ignore')
The format of the data needs to be altered for cases of missing data. The way missing data is handled is to have a -99 as a placeholder, but that will skew our data when we attempt to do any regression modeling. To fix this we have to alter the actual excel file and replace all -99s to 0s. This will also give us a pseudo binary flag for any column with 0 being false and 1 being true. The data set can be found here [TraffickingData](https://www.ctdatacollaborative.org/dataset/resource/511adcb7-b1a2-4cc7-bf2f-0960d43a49cc).
The file is in a CSV or comma-seperated values file so we can use the built in pandas parser to create a data frame.
globalFrame = pd.read_csv("trafficking.csv")
This dataframe is much too large to be interpreted so we must get rid of unrelated data. In addition I renamed most of the columns to be more fitting to the actual data. When the names of the columns take less space we can also read the dataframe better. I also changed what a missing entry looks like in the majorityStatus column as unknown is more fitting than 0.
In order to create a sum of cases we must add a new column with entries of one. Then we can groupby both majorityStatus and the year in order to create a new dataframe. This dataframe will be indexed by the year and age status and will have columns that have all of the cases and types of control, relation, etc. We can use seaborn to plot this dataframe as shown below which will show the vast difference between cases based on majority status.
globalFrame = globalFrame.drop(globalFrame.columns[0],1)
globalFrame = globalFrame.drop(["Datasource","ageBroad","majorityStatus","majorityEntry"],1)
globalFrame = globalFrame.rename(columns = {"yearOfRegistration":"year","majorityStatusAtExploit":"majorityStatus","meansOfControlDebtBondage":"DebtBondage",
"meansOfControlTakesEarnings":"EarningsStolen","meansOfControlRestrictsFinancialAccess": "WithholdsMoney",
"meansOfControlThreats":"Threats","meansOfControlPsychologicalAbuse":"PsychologicalAbuse",
"meansOfControlPhysicalAbuse":"PhysicalAbuse","meansOfControlSexualAbuse":"SexualAbuse",
"meansOfControlFalsePromises":"FalsePromises","meansOfControlPsychoactiveSubstances":"PsychoactiveSubstances",
"meansOfControlRestrictsMovement":"RestrictsMovement","meansOfControlRestrictsMedicalCare":"RestrictsMedicalCare",
"meansOfControlExcessiveWorkingHours":"ExcessiveWorkingHours","meansOfControlUsesChildren":"UsesChildren",
"meansOfControlThreatOfLawEnforcement":"ThreatOfLawEnforcement","meansOfControlWithholdsNecessities":"WithholdsNecessities",
"meansOfControlWithholdsDocuments":"WithholdsDocuments","meansOfControlOther":"OtherControl","meansOfControlNotSpecified":"ControlNotSpecified",
"recruiterRelationIntimatePartner":"IntimatePartner","recruiterRelationFriend":"Friend","recruiterRelationFamily":"Family",
"recruiterRelationOther":"OtherRelation","recruiterRelationUnknown":"UnknownRelation"})
frame = globalFrame[globalFrame['citizenship'] == "US"]
frame = frame.reset_index()
frame = frame.drop('index',1)
frame.loc[frame['majorityStatus'] == '0', 'majorityStatus'] = "unkown"
frame.head()
frame["Cases"] = 1
ageFrame = frame.groupby(["majorityStatus","year"]).sum()
ageFrame['Cases'].plot.bar()
ageFrame = ageFrame.reset_index()
ageFrame
To make a good hypothesis on the predictions we must first look at the values. First we must again edit the frame to get rid of data we cannot use. To do this we simply set ageFrame equal to ageFrame where the column majorityStatus is not unknown. Then, again using seaborn, we can see the differences in types of control used on adults and minors. By using hue we can input two different types of data and seaborn will automatically color and lable the points on the graph. Seaborn is also made much easier by using the data function which allows the user to just put in the column names as x and y. The clf() function is simply to prevent seaborn from plotting on the same graph in future calls. Based on the graphs we can assume that for most types of control that minors will be the victims.
ageFrame = ageFrame[ageFrame['majorityStatus'] != "unkown"]
physcialAbuse = sea.scatterplot(x = "year",y = "PhysicalAbuse",hue = "majorityStatus",data = ageFrame)
plt.show()
plt.clf()
sexualAbuse = sea.scatterplot(x = "year",y = "SexualAbuse",hue = "majorityStatus",data = ageFrame)
plt.show()
plt.clf()
pschologicalAbuse = sea.scatterplot(x = "year",y = "PsychologicalAbuse",hue = "majorityStatus",data = ageFrame)
plt.show()
plt.clf()
psychoactiveDrugs = sea.scatterplot(x = "year",y = "PsychoactiveSubstances",hue = "majorityStatus",data = ageFrame)
plt.show()
plt.clf()
To begin we first need to edit the original frame so that there is no unknown data skewing our predictions. Next we must create a column that is a binary flag to show if the victim was a minor or not. This column will be what we are attempting to predict. We will first begin to predict this by using just the year as we make a new frame that only includes the year and majority status flag columns. Here we will use the LinearRegression model from sklearn in order to form a best fit line over our data. After we use the fit function with X being our frame with year and y being the majority status flag column we can predict what majority status a victim would have in a year. As shown below when 2017 is inputted the output is close to 1 meaning the victim is most likely a minor. However, this is a poor prediction model as only the factor of year is taken into account. The next step is adding in our other variables.
predictFrame = frame[frame["majorityStatus"] != "unkown"]
for index,row in predictFrame.iterrows():
if(row.majorityStatus == "Minor"):
predictFrame.loc[index,"BinaryAge"] = 1
else:
predictFrame.loc[index,"BinaryAge"] = 0
predictAge = predictFrame[["year","BinaryAge"]]
model = LinearRegression()
model.fit(X = predictAge.drop("BinaryAge",1),y = predictAbuse["BinaryAge"])
model.predict(np.array([[2017]]))
Here we will attempt to predict the majority status of a victim based on the type of control the abuser used. This can be done by making a new dataframe that includes all of the types of control columns that we renamed earlier. Again we want to use the entire data frame except our binary age flag as the x and use the age flag as the y. Now when we do the linear regression fit our prediction model expects 18 inputs. To test a prediction simply make an numpy array that has a 0 if that type of control is not the one being predicted. As shown below we have a 1 in the fifth column so this is a prediction for when threats are used for control. Using the prediction model we can see that based on this type of control and the year it happened that the victim is likely a minor.
predictAbuse = predictFrame[["year","DebtBondage","EarningsStolen","WithholdsMoney","Threats","PhysicalAbuse","SexualAbuse","FalsePromises",
"PsychoactiveSubstances","RestrictsMovement","RestrictsMedicalCare","ExcessiveWorkingHours",
"UsesChildren","ThreatOfLawEnforcement","WithholdsNecessities","WithholdsDocuments",
"OtherControl","ControlNotSpecified","BinaryAge"]]
model = LinearRegression()
model.fit(X = predictAbuse.drop("BinaryAge",1),y = predictAbuse["BinaryAge"])
print("A victim of threats with 1 being a minor in 2017 controlled using threats is",model.predict(np.array([[2017,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0]])))
Another path you can look at is can you predict the majority status of the victim based on the recruiters relation to them. To do this we make a new frame of the year, recruitment relation columns that we renamed earlier, and the age flag. Once again X is our frame minus the flag and y is the flag itself that we are trying to predict. After the fit the prediction model expects 5 elements which are handled the same way as before. We can see through the predictions below that if the recruiter is a family member is is almost always a minor and if it is an intimate partner it is likely to be a minor.
predictRecruit = predictFrame[["year","IntimatePartner","Family","OtherRelation","UnknownRelation","BinaryAge"]]
model = LinearRegression()
model.fit(X = predictRecruit.drop("BinaryAge",1),y = predictAbuse["BinaryAge"])
print("A victim of a family member with one being a minor in 2017 is",model.predict(np.array([[2017,0,1,0,0]])))
print("A victim of an intimate partner with one being a minor in 2015 is",model.predict(np.array([[2015,1,0,0,0]])))
In order to display the vast amounts of data in the file in a readible format we must use a map. This map will show the entire world and will highlight countries that we have in our dataframe and label them with total cases. In order to accomplish this we use folium and a folium plugin called MarkerCluster(). First we must initialize the map which we will simply call m as a new folium map. Then a MarkerCluster() must be added to the map which is called cluster here. The next step is to iterrate over the data and check the citizenship of victim column to see which country to add a marker at. To add a marker we use the simple Marker() function that takes in a location in the form [lat,long] and can display a popup which is here set to the country we are adding the marker to. As the markers are added to the cluster it will automatically absorb any nearby markers and display a total number. Not only does this make the data easier to read, but it will also help you deal with large datasets that can crash your computer. The map is also interactive so users can move around and zoom in to specific countries to see their total reported cases.
m = folium.Map()
cluster = MarkerCluster().add_to(m)
for index,row in globalFrame.iterrows():
if(row.citizenship == "CO"):
folium.map.Marker(location = ['4.5709','-74.2973'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "MD"):
folium.map.Marker(location = ['47.4116','28.3699'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "RO"):
folium.map.Marker(location = ['45.9432','24.9668'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "UA"):
folium.map.Marker(location = ['48.3794','31.1656'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "BY"):
folium.map.Marker(location = ['53.7098','27.9534'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "HT"):
folium.map.Marker(location = ['18.9712','-72.2852'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "UZ"):
folium.map.Marker(location = ['41.3775','64.5853'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "LK"):
folium.map.Marker(location = ['7.8731','80.7718'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "LK"):
folium.map.Marker(location = ['7.8731','80.7718'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "MM"):
folium.map.Marker(location = ['21.9162','95.9560'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "UG"):
folium.map.Marker(location = ['1.3733','32.2903'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "ID"):
folium.map.Marker(location = ['-0.7893','113.9213'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "ID"):
folium.map.Marker(location = ['-0.7893','113.9213'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "KG"):
folium.map.Marker(location = ['42.882004','74.582748'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "AF"):
folium.map.Marker(location = ['33.9391','67.7100'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "ER"):
folium.map.Marker(location = ['15.1794','39.7823'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "NG"):
folium.map.Marker(location = ['17.6078','8.0817'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "NP"):
folium.map.Marker(location = ['28.3949','84.1240'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "PH"):
folium.map.Marker(location = ['12.8797','121.7740'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "KH"):
folium.map.Marker(location = ['12.5657','104.9910'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "BD"):
folium.map.Marker(location = ['23.6850','90.3563'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "US"):
folium.map.Marker(location = ['37.0902','-95.7129'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "TH"):
folium.map.Marker(location = ['15.8700','100.9925'],popup = row.citizenship).add_to(cluster)
if(row.citizenship == "VN"):
folium.map.Marker(location = ['14.0583','108.2772'],popup = row.citizenship).add_to(cluster)
m